53 research outputs found

    Towards Structural Classification of Proteins based on Contact Map Overlap

    Get PDF
    A multitude of measures have been proposed to quantify the similarity between protein 3-D structure. Among these measures, contact map overlap (CMO) maximization deserved sustained attention during past decade because it offers a fine estimation of the natural homology relation between proteins. Despite this large involvement of the bioinformatics and computer science community, the performance of known algorithms remains modest. Due to the complexity of the problem, they got stuck on relatively small instances and are not applicable for large scale comparison. This paper offers a clear improvement over past methods in this respect. We present a new integer programming model for CMO and propose an exact B &B algorithm with bounds computed by solving Lagrangian relaxation. The efficiency of the approach is demonstrated on a popular small benchmark (Skolnick set, 40 domains). On this set our algorithm significantly outperforms the best existing exact algorithms, and yet provides lower and upper bounds of better quality. Some hard CMO instances have been solved for the first time and within reasonable time limits. From the values of the running time and the relative gap (relative difference between upper and lower bounds), we obtained the right classification for this test. These encouraging result led us to design a harder benchmark to better assess the classification capability of our approach. We constructed a large scale set of 300 protein domains (a subset of ASTRAL database) that we have called Proteus 300. Using the relative gap of any of the 44850 couples as a similarity measure, we obtained a classification in very good agreement with SCOP. Our algorithm provides thus a powerful classification tool for large structure databases

    Solving Maximum Clique Problem for Protein Structure Similarity

    Get PDF
    A basic assumption of molecular biology is that proteins sharing close three-dimensional (3D) structures are likely to share a common function and in most cases derive from a same ancestor. Computing the similarity between two protein structures is therefore a crucial task and has been extensively investigated. Evaluating the similarity of two proteins can be done by finding an optimal one-to-one matching between their components, which is equivalent to identifying a maximum weighted clique in a specific "alignment graph". In this paper we present a new integer programming formulation for solving such clique problems. The model has been implemented using the ILOG CPLEX Callable Library. In addition, we designed a dedicated branch and bound algorithm for solving the maximum cardinality clique problem. Both approaches have been integrated in VAST (Vector Alignment Search Tool) - a software for aligning protein 3D structures largely used in NCBI (National Center for Biotechnology Information). The original VAST clique solver uses the well known Bron and Kerbosh algorithm (BK). Our computational results on real life protein alignment instances show that our branch and bound algorithm is up to 116 times faster than BK for the largest proteins

    Functional geometry of protein interactomes

    Get PDF
    Motivation Protein–protein interactions (PPIs) are usually modeled as networks. These networks have extensively been studied using graphlets, small induced subgraphs capturing the local wiring patterns around nodes in networks. They revealed that proteins involved in similar functions tend to be similarly wired. However, such simple models can only represent pairwise relationships and cannot fully capture the higher-order organization of protein interactomes, including protein complexes. Results To model the multi-scale organization of these complex biological systems, we utilize simplicial complexes from computational geometry. The question is how to mine these new representations of protein interactomes to reveal additional biological information. To address this, we define simplets, a generalization of graphlets to simplicial complexes. By using simplets, we define a sensitive measure of similarity between simplicial complex representations that allows for clustering them according to their data types better than clustering them by using other state-of-the-art measures, e.g. spectral distance, or facet distribution distance. We model human and baker’s yeast protein interactomes as simplicial complexes that capture PPIs and protein complexes as simplices. On these models, we show that our newly introduced simplet-based methods cluster proteins by function better than the clustering methods that use the standard PPI networks, uncovering the new underlying functional organization of the cell. We demonstrate the existence of the functional geometry in the protein interactome data and the superiority of our simplet-based methods to effectively mine for new biological information hidden in the complexity of the higher-order organization of protein interactomes.This work was supported by the European Research Council (ERC) Starting Independent Researcher Grant 278212, the European Research Council (ERC) Consolidator Grant 770827, the Serbian Ministry of Education and Science Project III44006, the Slovenian Research Agency project J1-8155 and the awards to establish the Farr Institute of Health Informatics Research, London, from the Medical Research Council, Arthritis Research UK, British Heart Foundation, Cancer Research UK, Chief Scientist Office, Economic and Social Research Council, Engineering and Physical Sciences Research Council, National Institute for Health Research, National Institute for Social Care and Health Research, and Wellcome Trust (grant MR/K006584/1).Peer ReviewedPostprint (author's final draft

    Shape Matching by Localized Calculations of Quasi-isometric Subsets, with Applications to the Comparison of Protein Binding Patches

    Get PDF
    International audienceGiven a protein complex involving two partners, the receptor and the ligand, this paper addresses the problem of comparing their binding patches, i.e. the sets of atoms accounting for their interaction. This problem has been classically addressed by searching quasi-isometric subsets of atoms within the patches, a task equivalent to a maximum clique problem, a NP-hard problem, so that practical binding patches involving up to 300 atoms cannot be handled. We extend previous work in two directions. First, we present a generic encoding of shapes represented as cell complexes. We partition a shape into concentric shells, based on the shelling order of the cells of the complex. The shelling order yields a shelling tree encoding the geometry and the topology of the shape. Second, for the particular case of cell complexes representing protein binding patches, we present three novel shape comparison algorithms. These algorithms combine a Tree Edit Distance calculation (TED) on shelling trees, together with Edit operations respectively favoring a topological or a geometric comparison of the patches. We show in particular that the geometric TED calculation strikes a balance, in terms of accuracy and running time between a purely geometric and topological comparisons, and we briefly comment on the biological findings reported in a companion paper.Étant donné un complexe protéique impliquant deux partenaires, un récepteur et un ligand, ce papier étudie le problème de comparer leur patchs de liaison, i.e. les ensembles d'atomes participant à leur interaction. Ce problème est classiquement formulé comme une recherche de sous-ensembles d'atomes quasi-isométriques entre les deux patchs, une tâche qui est équivalente à une recherche de cliques maximums. Ce problème étant NP-difficile, des patchs de liaison impliquant plus de 300 atomes ne peuvent-être traités. Nous étendons les travaux précédant dans deux directions. Premièrement, nous présentons un encodage générique pour les formes représentées par des complexes cellulaires. Nous partitionnons une forme en couches concentriques, basées sur ''l'ordre de couche'' des cellules du complexe. L'ordre des couches produisant un arbre de couches qui encode la géométrie et la topologie de la forme. Deuxièmement, pour le cas particulier de complexes cellulaires représentant des patchs de liaison de complexes protéiques, nous proposons trois algorithmes de comparaison de formes. Ces algorithmes combinent une distance d'édition d'arbre (TED, pour tree-edit-distance) sur les arbres de couches, avec des opérations d'éditions favorisant respectivement la comparaison topologique ou géométrique des patchs. Nous montrons en particulier que la TED géométrique établit un équilibre, en termes de précision et de temps de calculs, entre des comparaisons purement géométriques ou purement topologiques, et nous commentons brièvement les résultats biologiques qui sont détaillés dans un article compagnon

    Comparing Protein 3D Structures Using A_purva

    Get PDF
    Structural similarity between proteins provides significant insights about their functions. Maximum Contact Map Overlap maximization (CMO) received sustained attention during the past decade and can be considered today as a credible protein structure measure. We present here A_purva, an exact CMO solver that is both efficient (notably faster than the previous exact algorithms), and reliable (providing accurate upper and lower bounds of the solution). These properties make it applicable for large-scale protein comparison and classification. Availability: http://apurva.genouest.org Contact: [email protected] Supplementary information: A_purva's user manual, as well as many examples of protein contact maps can be found on A_purva's web-page.La similarité structurale entre protéines donne des renseignements importants sur leurs fonctions. La maximisation du recouvrement de cartes de contacts (CMO) a reçu une attention soutenue ces dix dernières années, et est maintenant considérée comme une mesure de similarité crédible. Nous présentons içi A_purva, un solveur de CMO exacte qui est à la fois efficace (plus rapide que les autres algorithmes exactes) et fiable (fournit des bornes supérieures et inférieures précises de la solution). Ces propriétés le rendent applicable pour des comparaisons et des classifications de protéines à grandes échelles. Disponibilité : http://apurva.genouest.org Contact : [email protected] Informations supplémentaires : Le manuel utilisateur d'A_purva, ainsi que de nombreux exemples de cartes de contacts de protéines sont disponibles sur le site web d'A_purva

    Identifying cellular cancer mechanisms through pathway-driven data integration

    Get PDF
    Abstract Motivation Cancer is a genetic disease in which accumulated mutations of driver genes induce a functional reorganization of the cell by reprogramming cellular pathways. Current approaches identify cancer pathways as those most internally perturbed by gene expression changes. However, driver genes characteristically perform hub roles between pathways. Therefore, we hypothesize that cancer pathways should be identified by changes in their pathway–pathway relationships. Results To learn an embedding space that captures the relationships between pathways in a healthy cell, we propose pathway-driven non-negative matrix tri-factorization. In this space, we determine condition-specific (i.e. diseased and healthy) embeddings of pathways and genes. Based on these embeddings, we define our ‘NMTF centrality’ to measure a pathway’s or gene’s functional importance, and our ‘moving distance’, to measure the change in its functional relationships. We combine both measures to predict 15 genes and pathways involved in four major cancers, predicting 60 gene–cancer associations in total, covering 28 unique genes. To further exploit driver genes’ tendency to perform hub roles, we model our network data using graphlet adjacency, which considers nodes adjacent if their interaction patterns form specific shapes (e.g. paths or triangles). We find that the predicted genes rewire pathway–pathway interactions in the immune system and provide literary evidence that many are druggable (15/28) and implicated in the associated cancers (47/60). We predict six druggable cancer-specific drug targets.This work was supported by the European Research Council (ERC) Consolidator Grant 770827 and the Spanish State Research Agency AEI 10.13039/501100011033 [grant number PID2019-105500GB-I00].Peer ReviewedPostprint (published version

    Graphlet eigencentralities capture novel central roles of genes in pathways

    Get PDF
    Motivation Graphlet adjacency extends regular node adjacency in a network by considering a pair of nodes being adjacent if they participate in a given graphlet (small, connected, induced subgraph). Graphlet adjacencies captured by different graphlets were shown to contain complementary biological functions and cancer mechanisms. To further investigate the relationships between the topological features of genes participating in molecular networks, as captured by graphlet adjacencies, and their biological functions, we build more descriptive pathway-based approaches. Contribution We introduce a new graphlet-based definition of eigencentrality of genes in a pathway, graphlet eigencentrality, to identify pathways and cancer mechanisms described by a given graphlet adjacency. We compute the centrality of genes in a pathway either from the local perspective of the pathway or from the global perspective of the entire network. Results We show that in molecular networks of human and yeast, different local graphlet adjacencies describe different pathways (i.e., all the genes that are functionally important in a pathway are also considered topologically important by their local graphlet eigencentrality). Pathways described by the same graphlet adjacency are functionally similar, suggesting that each graphlet adjacency captures different pathway topology and function relationships. Additionally, we show that different graphlet eigencentralities describe different cancer driver genes that play central roles in pathways, or in the crosstalk between them (i.e. we can predict cancer driver genes participating in a pathway by their local or global graphlet eigencentrality). This result suggests that by considering different graphlet eigencentralities, we can capture different functional roles of genes in and between pathwaysThis study received support from the following sources: The European Research Council (ERC) Consolidator Grant 770827 (awarded to NP); The Spanish State Research Agency AEI 10.13039/501100011033 grant number PID2019-105500GB-I00 (awarded to NP); and University College London Computer Science (awarded to SW). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Peer ReviewedPostprint (published version

    Characterizing the Morphology of Protein Binding Patches

    Get PDF
    International audienceLet the patch of a partner in a protein complex be the collection of atoms accounting for the interaction. To improve our understanding of the structure-function relationship, we present a patch model decoupling the topological and geometric properties. While the geometry is classically encoded by the atomic positions, the topology is recorded in a graph encoding the relative position of concentric shells partitioning the interface atoms. The topological-geometric duality provides the basis of a generic dynamic programming-based algorithm comparing patches at the shell level, which may favor topological or geometric features. On the biological side, we address four questions, using 249 cocrystallized heterodimers organized in biological families. First, we dissect the morphology of binding patches and show that Nature enjoyed the topological and geometric degrees of freedom independently while retaining a finite set of qualitatively distinct topological signatures. Second, we argue that our shell-based comparison is effective to perform atomic-level comparisons and show that topological similarity is a less stringent than geometric similarity. We also use the topological versus geometric duality to exhibit topo-rigid patches, whose topology (but not geometry) remains stable upon docking. Third, we use our comparison algorithms to infer specificity-related information amidst a database of complexes. Finally, we exhibit a descriptor outperforming its contenders to predict the binding affinities of the affinity benchmark. The softwares developed with this article are available from http://team.inria.fr/abs/vorpatch_compatch/

    A functional analysis of omic network embedding spaces reveals key altered functions in cancer

    Get PDF
    Abstract Motivation Advances in omics technologies have revolutionized cancer research by producing massive datasets. Common approaches to deciphering these complex data are by embedding algorithms of molecular interaction networks. These algorithms find a low-dimensional space in which similarities between the network nodes are best preserved. Currently available embedding approaches mine the gene embeddings directly to uncover new cancer-related knowledge. However, these gene-centric approaches produce incomplete knowledge, since they do not account for the functional implications of genomic alterations. We propose a new, function-centric perspective and approach, to complement the knowledge obtained from omic data. Results We introduce our Functional Mapping Matrix (FMM) to explore the functional organization of different tissue-specific and species-specific embedding spaces generated by a Non-negative Matrix Tri-Factorization algorithm. Also, we use our FMM to define the optimal dimensionality of these molecular interaction network embedding spaces. For this optimal dimensionality, we compare the FMMs of the most prevalent cancers in human to FMMs of their corresponding control tissues. We find that cancer alters the positions in the embedding space of cancer-related functions, while it keeps the positions of the noncancer-related ones. We exploit this spacial ‘movement’ to predict novel cancer-related functions. Finally, we predict novel cancer-related genes that the currently available methods for gene-centric analyses cannot identify; we validate these predictions by literature curation and retrospective analyses of patient survival data.This project has received funding from the European Research Council (ERC) Consolidator Grant 770827 and the Spanish State Research Agency AEI 10.13039/501100011033 grant number PID2019-105500GB-I00.Peer ReviewedPostprint (published version
    corecore